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ABSTRACT 

XML data warehouses form an interesting basis for decision- 
support applications that exploit heterogeneous data from 
multiple sources. However, XML-native database systems 
currently suffer from limited performances in terms of man- 
ageable data volume and response time for complex ana- 
lytical queries. Fragmenting and distributing XML data 
warehouses (e.g., on data grids) allow to address both these 
issues. In this paper, we work on XML warehouse fragmen- 
tation. In relational data warehouses, several studies recom- 
mend the use of derived horizontal fragmentation. Hence, 
we propose to adapt it to the XML context. We particularly 
focus on the initial horizontal fragmentation of dimensions' 
XML documents and exploit two alternative algorithms. We 
experimentally validate our proposal and compare these al- 
ternatives with respect to a unified XML warehouse model 
we advocate for. 
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1. INTRODUCTION 

Decision-support applications currently exploit more and 
more heterogeneous data from various sources. In this con- 
text, the extensible Markup Language (XML) is becoming a 
standard for representing complex business data 3 and can 
greatly help in their integration, warehousing and analysis. 
Many efforts toward XML data warehousing have indeed 
been achieved in the past few years O [21], as well as ef- 
forts for extending the XQuery language with near On-Line 
Analytical Processing (OLAP) capabilities such as advanced 
grouping and aggregation features 3 . This research notably 
aims at taking into account specificities of XML data (e.g., 
heterogeneous number and order of dimensions or complex 
measures in facts, ragged dimension hierarchies, etc.) that 
would be intricate to handle in a relational environment. 
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XML-native database management systems (DBMSs) sup- 
porting XQuery should form the basic storage component 
of XML warehouses. However, they currently present poor 
performances when dealing with the large data volumes and 
complex analytical queries that are typical in data ware- 
housing. Distributing a warehouse on a grid-like network 
can contribute to improve storage and query performance. 
Such a framework indeed provides both computing power 
and distributed storage resources. Thus, it can be used to 
handle large data warehouses efficiently. 

Traditionally, the distribution process starts with data 
fragmentation. Fragmentation consists in splitting a data 
set into two or more parts (fragments) such that the combi- 
nation of the fragments yields the original warehouse with- 
out any loss nor addition of information. In the relational 
context, derived horizontal fragmentation is acknowledged 
as best-suited to data warehouses, because it takes decision- 
support query requirements into consideration and avoids 
computing unnecessary join operations ' 2 . Several approaches 
have also been proposed for XML data fragmentation, but 
they do not take data warehouse multidimensional architec- 
tures (i.e., star-like schemas) into account. 

In this paper, we thus propose to adapt derived horizon- 
tal fragmentation techniques developed for relational data 
warehouses to the XML context. We particularly focus on 
the initial horizontal fragmentation of dimensions and adapt 
and compare the two major algorithms that address this is- 
sue: the predicate construction |17) and the affinity-based 
[15] strategies. 

Adapting these relational techniques onto XML warehouses 
requires a well-identified XML warehouse model. Unfortu- 
nately, although XML warehouse architectures from the lit- 
erature share a lot of concepts (mostly originating from clas- 
sical data warehousing), they are nonetheless all different. 
Hence, as a secondary contribution of this paper, we pro- 
pose a unified, reference XML data warehouse model that 
synthesizes and enhances existing models, and on which we 
can base our fragmentation work. 

The remainder of this paper is organized as follows. First, 
we introduce the state of the art regarding XML data ware- 
houses, as well as our own reference XML warehouse model 
(Section[2|. Then, we present general definitions about frag- 
mentation and discuss existing research related to relational 
data warehouse and XML data fragmentation (Section [3|. 
We detail the specifics of our adaptation to XML data ware- 
house fragmentation (Section[3| and experimentally demon- 
strate that proper fragmentation significantly reduces the 



execution time of analytical XQueries (Section [5]). We fi- 
nally conclude this paper and hint at future research direc- 
tions (Section [6]). 

2. XML DATA WAREHOUSING 

2.1 Related work 

Several studies address the issue of designing and build- 
ing XML data warehouses. They propose to use XML doc- 
uments to manage or represent facts and dimensions. The 
main objective of these approaches is to enable a native stor- 
age of the warehouse and its easy interrogation with XML 
query languages. 

Pokorny models a XML-star schema in XML by defin- 
ing dimension hierarchies as sets of logically connected col- 
lections of XML data, and facts as XML data elements 
[21j . Hiimmer et al. propose a family of templates, named 
XCube, enabling the description of a multidimensional struc- 
ture (dimension and fact data) for integrating several data 
warehouses into a virtual or federated warehouse . Rusu 
et al. propose a methodology, based on the XQuery tech- 
nology, for building XML data warehouses. This method- 
ology covers processes such as data cleaning, summariza- 
tion, intermediating XML documents, updating/linking ex- 
isting documents and creating fact tables [22]. Facts and 
dimensions are represented by XML documents built with 
XQueries. Park et al. introduce a framework for the multidi- 
mensional analysis of XML documents, named XML-OLAP 
[20j . XML-OLAP is based on an XML warehouse where ev- 
ery fact and dimension is stored as an XML document. The 
proposed model features a single repository of XML docu- 
ments for facts and multiple repositories of XML documents 
for dimensions (one repository per dimension). Eventually, 
Boussaid et al. propose an XML-based methodology, named 
X- Warehousing, for warehousing complex data [Tj . They use 
XML Schema as a modeling language to represent user anal- 
ysis needs. 

2.2 XML data warehouse reference model 



The studies enumerated in Section 12.11 though all dif- 
ferent, more or less converge toward a unified XML ware- 
house model. They mostly differ in the way dimensions are 
handled and the number of XML documents that are used 
to store facts and dimensions. A performance evaluation 
study of these different representations showed that repre- 
senting facts in one single XML document and each dimen- 
sion in one XML document allowed the best performance [6] . 
Moreover, this representation allows to model constellation 
schemas without duplicating dimension information. Sev- 
eral fact documents can indeed share the same dimensions. 
Furthermore, since each dimension and its hierarchical lev- 
els are stored in one XML document, dimension updates 
are more easily and efficiently performed than if dimensions 
were either embedded with the facts or all stored in one 
single document. 

Hence, we adopt this architecture model. More precisely, 
our reference data warehouse is composed of the following 
XML documents (Definition [T|: 

1. dw-model.xml that represents warehouse metadata, the 
XML graph representing warehouse metadata is de- 
noted Gdw-modei; 



2. a set of factsf.xml documents that each store infor- 
mation related to set of facts /; 

3. a set of dimensiond.xml documents that each store a 
given dimension d's member values. 

Definition 1. An XML document is defined as a graph 
(XML graph) whose nodes represent document elements or 
attributes, and edges represent the element / sub-element (or 
parent-child) relationship. Edges are labeled with element or 
attribute names. 

A factsf.xml document stores facts (Figure [ij a)). The 
document root node, FactDoc, is composed of fact subele- 
ments that each instantiate a fact, i.e., measure values and 
dimension references. These identifier-based references sup- 
port the fact-to-dimension relationship. The XML graph 
representing fact set / is denoted Gfactsf. 

A dimensiond.xml document helps instantiate one dimen- 
sion, including any hierarchical level (Figure [ijb)). Its root 
node, dimension, is composed of Level nodes. Each one 
defines a hierarchy level composed of instance nodes that 
each define the level's member attribute values. In addi- 
tion, an instance element contains Roll-up and Drill-Down 
attributes that define the hierarchical relationship within 
dimension d. The XML graph representing dimension d is 
denoted Gd 




Figure 1: factsf.xml (a) and dimensiond.xml (b) 
graph structure 



3. DATABASE FRAGMENTATION 
3.1 Definition 

There are three fragmentation types in the relational con- 
text [2]: vertical fragmentation, horizontal fragmentation 
and hybrid fragmentation. 

Vertical fragmentation splits a relation R into sub-relations 
that are projections of R with respect to a subset of at- 
tributes. It consists in grouping together attributes that 
are frequently accessed by queries. Vertical fragments are 
built by projection. The original relation is reconstructed 
by joining the fragments. 

Horizontal fragmentation divides a relation into subsets 
of tuples using query predicates. It reduces query process- 
ing costs by minimizing the number of irrelevant accessed 
instances. Horizontal fragments are built by selection. The 



original relation is reconstructed by fragment union. A vari- 
ant, derived horizontal fragmentation, consists in partition- 
ing a relation with respect to predicates defined on another 
relation. 

Finally, hybrid fragmentation consists of either horizontal 
fragments that are subsequently vertically fragmented, or 
vertical fragments that are subsequently horizontally frag- 
mented. 

3.2 Data warehouse fragmentation 

Many research studies address the issue of fragmenting 
relational data warehouses either to efficiently process ana- 
lytical queries or to distribute the warehouse. 

To improve ad-hoc query performance, Datta et al. ex- 
ploit a vertical fragmentation of facts to build the Cuio index 
[8], while Golfarelli et al. apply the same fragmentation on 
warehouse views |10] . Bellatreche and Boukhalfa apply hor- 
izontal fragmentation to a star-schema [2]. Their fragmen- 
tation strategy is based on a query workload and exploits 
a genetic algorithm to select a optimal partitioning schema 
that minimizes query cost. Finally, Wu and Buchmaan rec- 
ommend to combine horizontal and vertical fragmentation 
for query optimization [24) . A fact table can be horizontally 
partitioned according to one or more dimensions, it can also 
be vertically partitioned according to its dimension foreign 
keys. 

To distribute a data warehouse, Noaman et al. exploit a 
top-down strategy that uses horizontal fragmentation [17| . 
The authors propose an algorithm for deriving horizontal 
fragments from the fact table based on queries that are de- 
fined on all dimension tables. Finally, Wehrle et al. propose 
to distribute and query a warehouse on a computing grid 
[23j . They use derived horizontal fragmentation to split the 
data warehouse and build a so-called block of chunks, a data 
set defining a fragment. 

In summary, these proposals generally exploit derived hor- 
izontal fragmentation to reduce irrelevant data access rate 
and efficiently process join operations across multiple rela- 
tions [21 1171 123) . In the literature, the prevalent methods 
used for derived horizontal fragmentation are the following 

m- 

• Predicate construction. This method fragments a 
relation by using a complete and minimal set of pred- 
icates [17) . Completeness means that two relation in- 
stances belonging to the same fragment have the same 
probability of being accessed by any query. Minimality 
garantees that there is no redundancy in predicates. 

• Affinity-based fragmentation. This method is an 
adaptation of vertical fragmentation methods to hori- 
zontal fragmentation [T5]. It is based on the predicate 
affinity concept [25], where affinity defines query fre- 
quency. Specific matrices (predicate usage and affinity 
matrices) are exploited to cluster selection predicates. 
A cluster is defined as a selection predicate cycle and 
forms a dimension graph fragment. 

3.3 XML database fragmentation 

Recently, several fragmentation techniques for XML data 
have been proposed. They split an XML document into a 
new set of XML documents. Their main objective is either 
to improve XML query performance [5] or to distribute or 
exchange XML data over a network [4l[5]. 



To fragment XML documents. Ma et al. define a new 
fragmentation type: split [T3], which is inspired from the 
oriented-object domain. This fragmentation splits XML 
document elements and assigns a reference to each sub- 
element. The references are then added to the Document 
Type Definition (DTD) defining the XML document. An- 
drade et al. propose to apply fragmentation to an homoge- 
neous XML collection 1 . They adapt traditional fragmen- 
tation techniques to an XML document collection and base 
their proposal on the Tree Logical Class algebra (TLC) [19) . 

Bose and Fegaras use XML fragments for data exchange 
in a peer-to-peer network (P2P), called XP2P [S]. XML 
fragments are interrelated and each is uniquely identified 
by an ID. The authors propose a fragmentation schema, 
called Tag Structure, to define the structure of data and 
fragmentation information. Bonifati et al. also define XML 
fragments for a P2P framework |4j. An XML fragment is 
obtained and identified by a single path expression, a root- 
to-node path expression XP, and managed on a specific peer. 

In summary, these proposals adapt classical fragmentation 
methods to split XML data. An XML fragment is defined 
and identified by a path expression ^ or an XML algebra 
operator [T]. Fragmentation is performed on a single XML 
document [TH] or on an homogeneous XML collection pp. 

4. FRAGMENTING XML DATA 
WAREHOUSES 

4.1 Motivation 

Approaches dealing with fragmentation in XML databases 
adopt only primary horizontal fragmentation applied onto 
one XML document (Section 13.3)) . They use fragmentation 
to minimize XML query expression execution cost. How- 
ever, in XML data warehouses, decision-support queries are 
more complex: they involve multiple join operations over 
multiple XML (fact and dimension) documents. Hence, pri- 
mary horizontal fragmentation is not adapted in our con- 
text. Relational data warehouse fragmentation approaches 
recommend to use derived horizontal fragmentation (Sec- 
tion [321)) which is more adapted to analytical queries. In 
addition, there are, to the best of our knowledge, no XML 
data warehouses fragmentation works in the literature. In 
consequence, we propose to adapt horizontal derived frag- 
mentation to XML data warehouses (Definition [2} . 

Definition 2. In an XML data warehouse, derived hor- 
izontal fragmentation first splits horizontally Gdimension^ 
graphs with respect to a given workload W , and then par- 
titions the G fact J graphs with respect to Gdimension^ fro,9- 
ments. 

4.2 General principle 

In our fragmentation methodology, we first apply a pri- 
mary horizontal fragmentation onto warehouse dimensions 
using either the predicate construction method, denoted PC, 
or the affinity-based method, denoted AB (Section I3.2|l . 
Both these methods input selection predicates (Definition 
[Sj from W (Section I4.3.1[) . AB also exploits data access 
frequencies. Our adaptations of PC and AB to the XML 
context are described in Sections 14.3.21 and 14.3.31 respec- 
tively. Both help fragment Gdimension^ graphs. Note that 
we consider both the PC and AB methods to compare their 
efficiency, which has never been addressed in the literature 



as far as we know. Based on these fragments, we then frag- 
ment the G facts f graphs and build a fragmentation schema 
for the whole XML data warehouse. This process is detailed 
in Section \4A\ 

Definition 3. A selection predicate is defined by expres- 
sion p ~ Pa^9[value \ %xPath{Pa^) \ Q\, where Pa^ and 
Q are path expressions and Pa^ is defined on attribute a^, 
9 G {=, <, >, <, >, 7^}, value G Dk where Dk is the domain 
of a^, and ^xpath is any XPath function^. 

4.3 Primary horizontal fragmentation 

4.3.1 Selection predicate extraction 

The set P of selection predicates used to fragment the 
G dimension graphs is identified by parsing W . For exam- 
ple, pi := $y / attribute[@id = cjnationJiey']/ ©value >' 
15' and p2 ~ $y / attribute[@id =' pJype']/ ©value —' 
PROMOBURNISHEDCOPPER' are selection predicates 
obtained from query q\ in the sample XQuery workload pro- 
vided in Figure (2] 

qi for $x in //FactDoc/Fact, 

$y in //dimensions[@dim-id— 'Customer']/Level/instancc 
$z in //dimensions[@diin-id— 'Part']/Level/instance 
where $y/attribute[@id—'c_nation_kcy']/@ value— '13' 
and $y/attributo[@id='p_typo']/@valuc='PROMO 
BURNISHED COPPER' 

and $x/dimension[@dim-id— 'Customer ']/@vaIuc-id—$y/@id 
and $x/dimension[@dim-id='Part']/@value-id=$z/@id 
return $x 



gio for $x in //FactDoc/Fact, 

$y in //dimensions[@dim-id— 'Customer']/Lcvel/instance 
$z in //dimGnsions[@dim-id— 'Datc']/LcvcI/instancc 
where $y/attribute[@id— 'c_nation_kcy']/@vaIuc>'15' 
and $y/attribute[@id—'d_date_name']/@ value— 'Saturday' 
and $x/dimension[@dim-id— 'Customer ']/@ value-id— $y/@id 
and $x/dimcnsion[@dim-id='Part']/@value-id=$z/@id 
return $x 



Figure 2: Workload snapshot 



4.3.2 PC primary horizontal fragmentation 
Principle 

Based on selection predicate set P and metadata from Gdw-mode 
PC identifies candidate Gdimension^ graphs for fragmenta- 
tion. A candidate dimension graph Gcandidate^ is a Gdimension^ 
graph targeted by workload queries. 

For each candidate dimension and its corresponding se- 
lection predicate set Pd C P, a, set of complete and minimal 
selection predicates Pd is generated with the COM-MIN al- 
gorithm [18) that guarantees completeness and minimality 
(Section 13. 2|) . PC finally builds from Pd a set of minterms 
that horizontally fragment the Gcandidate^ graphs. 

Fragmentation methodology 

1. Attribution of selection predicates to dimen- 
sion XML graphs. This step affects to each di- 
mension graph Gdimensionj its correspoudiug selection 
predicate set Pd C P. Pd is identified from Gdw-modei, 

^http; / /www.w3.org/TR,/xpath-functions/ 



which stores for each dimension its corresponding at- 
tributes. Hence, we can identify candidate dimension 
graphs (G candidate a) for horizontal fragmentation. 

Example. Predicate p2 contains attribute cjnation_key. 
In Gdw-modei , cjfiationjtcy is a member of the customer 
dimension. Hence, we identify Gdimension^^stomcr ^ 
candidate dimension for fragmentation. 

2. Selection predicate completeness and minimal- 
ity. In this step, we apply the COM-MIN algorithm, 
which inputs Pd and outputs a set of complete and 
minimal predicates P^j. Given P^, the set Md of minterm 
predicates is then constructed. 

Md = {mi\mi = AqjGPg*}, where q* = qj or q* = -.g^, 
1 < ji < n, 1 < i < 2" and n represents the number 
of selection predicates. A minterm predicate rui £ Md 
is the conjunction of all predicates from P^, taken in 
natural or negative form, mi = pi A -ip2 is an exam- 
ple minterm predicate, where pi and P2 are the sample 
selection predicates from Section [4.21 

Example. Let Pcustomer = {%y / attribute[@id — 
' C-nationkey']/ ©value = 13, $y / attribute[@id = 
'cjnationkey']/ ©value > 15} be a complete and min- 
imal set obtained by the COM-MIN algorithm for di- 
mension customer. A minterm mi is %y / attribute[ 
©id =' cjnationkey']/ ©value = 13 and %y / attribute[ 
©id =' C-uation key']/©value <= 15. 

3. Candidate graph fragmentation. This step builds 
primary horizontal fragments from Gcandidate^ ■ A frag- 
ment is obtained by associating to each minterm pred- 
icate mi £ Md the set of nodes in Gcandidate^ that 
verifies it. 

Example. Minterm mi is used to fragment 

Gc 

andidatecustomer • 

4.3.3 AB primary horizontal fragmentation 
Principle 

AB uses query frequency to build horizontal fragments by 
exploiting specific matrices (predicate usage and affinity ma- 
trices) . It clusters selection predicates from P by exploiting 
a graphical algorithm. A cluster is defined as a selection 
predicate cycle and forms a fragment of a Gdimension^ graph. 

Fragmentation methodology 

1. Predicate usage matrix construction. The pred- 
icate usage matrix, PUM, is built based on P. It 
defines predicate usage of each query qi £ W. Matrix 
lines represent workload queries and columns simple 
selection predicates from P. General term PU M{i, j) 
is set to one if qi includes predicate Pj and to zero oth- 
erwise. In addition, the usage frequency of each query 
qi is stored in a vector Freq. 

Example. Tables [T] and [2] provide examples of PUM 
matrix and query frequency vector, respectively. 

2. Predicate affinity matrix construction The pred- 
icate affinity matrix, Aff, is built from the PUM ma- 
trix (Table [3)|. It is a n x n matrix, where n represents 
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n represents the number of selection predicates in P and m 
the number of queries in W. 

Table 1: Sample predicate usage matrix 
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Table 2: Sample query frequency vector 



the number of selection predicates in P. Aff matrix 
cells can contain numeric or nonnumeric (=J>, <;= and 
*) values. A numeric value of an Aff{i,j) cell gives 
the frequency sum of all queries referencing both pred- 
icates Pi and pj . A value " indicates that predicate 
Pi implies predicate pj; a value "<^" indicates that 
predicate pj implies predicate Pi; and a value "*" indi- 
cates that predicates pi and Pj are similar. Two pred- 
icates Pi and Pj are similar if: (1) they are defined on 
the same attribute; (2) there exists a query qi that uses 
predicates Pi and Pc and another query qj which uses 
predicates Pj and pc; and (3) Pc is a selection predicate 
that is defined on another attribute than the pi and pj 
predicates [16] . 

Example. Table |3]shows a example of affinity matrix. 
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Table 3: Sample predicate affinity matrix 



3. Predicate clustering. This step exploits the graph- 
ical algorithm proposed by Navathe et al. [15] for 
vertical fragmentation, which has been been adapted 
for horizontal fragmentation [16]. This algorithm in- 
puts Af f and considers it as a complete graph, GAff- 
Then, it forms a linearly connected spanning-tree. A 
tree node represents a selection predicate Pi in Aff{i,j) 
and an edge e{pi,pj) an affinity value. The algorithm 
detects and extracts a set of cycles C, where a cycle 
Ci £ C groups selection predicates sharing values in 
Aff. 

Example. Figure|3]shows a sample GAff that is build 
from Aff{i,j). C = {ci, C2, Cz}, where z represents 
the number of cycles, ci — {pi,p3,p5}. 

4. Compose predicate terms. Cycle set C is first eval- 
uated to determine distinct common attributes in C 




Figure 3: Predicate clustering example 

predicates and construct a specific table called predi- 
cate term schematic table. This table stores attribute 
usage for each d . Based on this table, predicate terms 
ti are constructed. A predicate term ti constitutes an 
horizontal entry in the predicate term schematic table 
and covers all common attributes. 

Example. Table|3]gives an example of predicate term 
schematic table. Predicates in cycle ci do not in- 
clude attribute 02. ci is hence divided to a set of 
sub-cycle cij. Each c\j sub-cycle contains predicates 
from ci and a predicate Pj that includes attribute ■ 
t\ = pi Aps Ap5 A P2 for j = 2 is an example of predi- 
cate term. 
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ao represents an attribute from dimension d and r is the 
number of attributes in C. 

Table 4: Predicate term schematic table 



5. Candidate graph fragmentation. Each obtained 
predicate term and an additional predicate, called ELSE, 
form an horizontal fragment. The ELSE predicate is 
the negation of the conjonction of all predicate terms. 
It is added to ensure fragmentation completeness. To 
ensure fragmentation disjonction, a set of minterms is 
also created (Section 14. 3. 2|) . 

Example, ti and ELSE=-ipi or -ipa or -ips or -ip2 are 
predicate terms used to fragment Gdimension^^stomcr ■ 

4.4 Fact fragmentation 

The Gfactsf graphs are finally fragmented according to 
horizontal fragments obtained by applying either the PC or 
AB method on dimensions. The fragmentation of Gfactsj 
graphs is achieved by semi-join operations based on a virtual 
key reference. This key defines the relationships between 
G dimension and G f act j graphs. It is explicitly defined by 
the join qualification expression provided in Figure [4] and 
consists of a conjunction of two path expressions. These 
path expressions check whether nodes in Gdimensionj graphs 
correspond to nodes in Gfactsj graphs. 

We finally build an XML document that represents the 
fragmentation schema, fragmentation^schema.xml. Its cor- 
responding graph, denoted Schema, is provided in Figure[5l 
The root node, Schema, is composed of fragment elements 



document (facts f .xml) /FactDoc/ dimension [@dim-id= 
document (dimension^ .a;mO /dimension/Level/@id] 
and 

document (facts / .xmO /FactDoc/dimension [Ovalue-id 
=document (dimension^ .xml) / dimension/Level [Oid 
=@dim-id] /instance/Oid] 

Figure 4: Join qualification 

describing the obtained fragments. Each fragment is identi- 
fied by an @id attribute and contains dimension elements. 
A dimension element is identified by a Qname attribute 
and contains predicate elements that store minterms used 
for fragmentation. 



Schema 




@name 





Figure 5: Fragmentation schema 



5. EXPERIMENTS 

5.1 Experimental conditions 

In order to validate our proposal experimentally, we use 
XWeB (the XML Data Warehouse Benchmark) [ij. XWeB 
is based on the reference model defined in Section [2.21 and 
proposes a test XML data warehouse and its associated 
XQuery decision-support workload. 

XWeB's warehouse consists of sale facts characterized by 
the amount (of purchased products) and quantity (of pur- 
chased products) measures. These facts are stored in the 
facts sales -xml document and are described by four dimen- 
sions: Customer, Supplier, Date and Part stored in the 
dimensioncustomer -xml , dimensionsuppUer-xml, dimension 
Date-xml and dimension part-xml documents, respectively. 
XWeB's warehouse characteristics are displayed in Table [S] 

XWeB's workload is composed of queries that exploit the 
warehouse through join and selection operations. We extend 
this workload by adding queries and selection predicates in 
order to obtain a significant fragmentation. Our workload 
is available on-line 0- We ran our tests on a Pentium 2 GHz 
PC with 1 GB of main memory and an IDE hard drive under 
Windows XP. We use the X-Hive XML native DBMS to 
store and query the warehouse. 

5.2 Experiments 

Our experiments measure workload execution time, with 
and without using fragmentation and separately evaluate 

2 

http: / /eric. univ-lyon2.fr/ ~limahboubi/Workload/woikload.xq 
http: / /www. x-hivc.com/products/db/ 



Facts 


Number of cells 


Sale facts 


7000 


Dimensions 


Number of instances 


Customer 


1000 


Supplier 


1000 


Date 


500 


Part 


1000 


DocumGrits 


Size (MB) 


facts sales -xml 


2.14 


dimensioncustomer .xml 


0.431 


dimension Supplier -xml 


0.485 


dimension D ate -xml 


0.104 


dimension part .xml 


0.388 



Table 5: XWeB warehouse characteristics 



the PC and AB primary fragmentation strategies (Section 
14.3.21 and 14.3.31 respectively). The fragments we achieve 
are stored in distinct collections to simulate data distribu- 
tion. Each collection can indeed be considered as a distinct 
node/site and can be identified, targeted and queried sepa- 
rately. To measure query execution time over a fragmented 
warehouse, we first identify the required fragments with the 
Schema graph. Then, we execute the query over each frag- 
ment and save execution time. To simulate a parallel exe- 
cution, we only consider the maximum execution time. We 
conducted two series of experiments. 

5.2.1 First series of experiments 

This series of experiments helps observe the impact of 
data warehouse size and workload characteristics on frag- 
mentation quality. For this purpose, we exploit three ware- 
house and workload configurations (Table |S} in which we 
vary warehouse size (i.e., the number of facts) and the num- 
ber of workload queries and selection predicates. 





Config. 1 


Config. 2 


Config. 3 


Number of facts 


800 


800 


4000 


Number of queries 


13 


19 


19 


Number of join op- 
erations 


22 


35 


35 


Number of predi- 
cates 


20 


30 


30 



Table 6: Warehouse and workload configurations 



Experiment results for configurations 1, 2 and three are 
showed on Figures [G] [3 and |8] respectively. In these figures, 
the X axis represents workload queries and the Y axis fea- 
tures query execution time when no fragmentation is applied 
on the warehouse, and when derived horizontal fragmenta- 
tion is applied with PC and AB primary fragmentation. 

For configuration 1, we obtain an average gain over no 
fragmentation of 72.95% with PC and 76.32% with AB. For 
configuration 2, in which the number of queries is increased 
over configuration 1, PC improves query execution time by 
74.53% and AB by 78.32% on average. Finally, in config- 
uration 3, we increase the number of facts and obtain an 
average gain of 62.59% with PC and 80,17% with AB. 

These results confirm that fragmentation improves query 
performance. They also show that AB provides more bene- 
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Figure 6: Configuration 1 results 
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Figure 7: Configuration 2 results 

fit tfian PC in all our test cases. We think this is thanks to 
AB's use of query frequencies to group in the same fragment 
all dimension instances and facts needed to perform a given 
join operation. In addition, we notice that PC fragmenta- 
tion gain significantly declines in configuration 3, i.e., when 
warehouse size increases. We think this is due to the num- 
ber of fragments produced by PC, which is greater than that 
obtained with AB (159 and 119, respectively). To further 
investigate this issue, we conduct a second series of exper- 
iment where we further observe PC and AB gain variation 
with respect to warehouse size. 

5.2.2 Second series of experiments 

This series of experiments helps aim at observing the effect 
of warehouse size on fragmentation gain. We vary warehouse 
size from 1000 to 5000 facts and measure the fragmentation 
gain achieved when using PC and AB primary fragmenta- 
tion. The results of these experiments are plotted in Fig- 
ure |9l whose X axis represents the number of facts and Y 
axis the corresponding gains obtained by PC and AB pri- 
mary fragmentation. 

Experiment results show that fragmentation gain declines 
when warehouse size decreases with both primary fragmen- 
tation methods. This is expected, since fragments become 
bigger and bigger, inducing a higher and higher scan cost 
when performing join operations. However, we also observe 
that performance degradation is reasonably slow for AB, 
while it is much steeper for PC. We believe that this is be- 
cause AB builds fragments containing data required to per- 
form the most frequent join operations in W , while storing 
less frequently accessed data in the ELSE fragment. PC 
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Figure 8: Configuration 3 results 

does not take this aspect into account. It just groups in 
the same fragment data accessed by one or more queries 
simultaneously. It also uses minterms that may distribute 
data required to answer a single query in different fragments, 
which multiplies reconstruction joins when accessing data. 
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Figure 9: Fragmentation gain vs. warehouse size 



6. CONCLUSION 

In this paper, we have adapt, to XML context, and com- 
pare the two prevailing primary horizontal fragmentation 
methods from the relational world, namely predicate con- 
struction and affinity-based fragmentation. We have exper- 
imentally confirmed that derived horizontal fragmentation 
helped improve query response time significantly. Moreover, 
we also showed that affinity-based fragmentation clearly out- 
performed predicate construction in all our experiments, 
which had never been demonstrated before as far as we 
know, even in the relational context. 

The natural follow-up of this work is to distribute frag- 
mented XML warehouses on a data grid. This raises several 
issues that include processing a global query into subqueries 
to be sent to the right nodes in the grid, and reconstructing 
a global result from subquery results. Properly indexing the 
distributed warehouse to guarantee good performance shall 
also be very important. 
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